Add Regression Tests by WilliamYue37 · Pull Request #85 · TensorAuto/OpenTau

WilliamYue37 · 2026-01-23T00:34:26Z

What this does

Runs GPU Regression Tests. The tests consist of training, resuming, and running inference on the model.

How it was tested

I ran it on Github Actions
see (https://github.com/TensorAuto/OpenTau/actions/runs/21304126877/job/61328345455?pr=85)

How to checkout & try? (for the reviewer)

Dispatch on github actions

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

shuheng-liu

Looks good to me

Copilot

Pull request overview

Adds a nightly (and manually dispatchable) GPU regression workflow that trains a model, runs a set of log-based sanity checks, converts a checkpoint, and runs inference—plus a few supporting config/doc tweaks.

Changes:

Add a new GitHub Actions workflow to run GPU regression training + inference with log validators.
Add helper scripts to validate training signals (loss drop, grad norm, grad sync, state dict keys).
Reduce distributed log spam by gating from_pretrained prints to the main process; update docs/configs for CI usage.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/opentau/policies/pi05/modeling_pi05.py`	Gate verbose loading/remapping prints to main process in distributed runs.
`docs/source/tutorials/inference.rst`	Update inference command to point at OpenTau’s inference script.
`configs/examples/accelerate_deepspeed_config.yaml`	Adjust example accelerate config process count (used by regression workflow).
`configs/dev/ci_config.json`	Update CI training config to use `pi05` + smaller action chunking and CI-specific settings.
`.github/workflows/regression_test.yml`	Add nightly GPU regression workflow (start runner, train, validate logs, convert, infer, stop runner).
`.github/workflows/gpu_test.yml`	Update GPU runner ASG name and reduce timeout.
`.github/scripts/utils.py`	Add shared `grep_file` helper for log parsing.
`.github/scripts/check_state_keys.py`	Add validator for missing/unexpected state dict keys in logs.
`.github/scripts/check_nonzero_grad_norm.py`	Add validator ensuring grad norm is present and non-zero.
`.github/scripts/check_loss_drop.py`	Add validator ensuring (smoothed) loss decreases and resume behavior is sane.
`.github/scripts/check_accumulate_grad_sync.py`	Add validator ensuring `accelerator.sync_gradients` matches grad accumulation cadence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/opentau/policies/pi05/modeling_pi05.py

Copilot · 2026-01-25T22:08:19Z

.github/scripts/check_accumulate_grad_sync.py

+    sync_grads = grep_file(arg.log_path, arg.re_pattern, processor=bool)
+    assert len(sync_grads) == arg.expected_length, (
+        f"Expected {arg.expected_length} sync_gradients, found {len(sync_grads)} in {arg.log_path}."
+    )
+    assert all(sg == ((i + 1) % arg.gradient_accumulation_steps == 0) for i, sg in enumerate(sync_grads)), (


processor=bool will interpret both "True" and "False" strings as True (because any non-empty string is truthy), so sync_grads will be incorrect and the assertion will fail/behave incorrectly. Convert explicitly from the captured string (e.g., map "True"->True and "False"->False) before running the pattern check.

.github/scripts/check_state_keys.py

.github/scripts/utils.py

Copilot · 2026-01-25T22:08:20Z

.github/workflows/regression_test.yml

+      - name: Set up Libero Configs
+        shell: bash
+        run: |
+          source .venv/bin/activate
+          mkdir -p /tmp/libero-assets/libero/libero
+          export LIBERO_CONFIG_PATH="$(pwd)/.github/assets/libero"
+


In GitHub Actions, export LIBERO_CONFIG_PATH=... inside a run step only affects that step's shell; it won’t persist to later steps like "Run Training". If the training/inference needs this env var, write it to $GITHUB_ENV (or set it under the job/step env:) so it’s available in subsequent steps.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

added regression tests

dbfc9cc

WilliamYue37 self-assigned this Jan 23, 2026

WilliamYue37 added 9 commits January 22, 2026 16:49

added wandb login

57b8665

increaesd shared memory

ecc4757

fixed shared mem

2fa3840

added nccl debug messages

ae230dc

added NCCL debug

c5840e8

remove resume for now

051fa26

only log missing keys on main process

02e87ad

fix

72ebadd

only run nightly

b7db10d

WilliamYue37 requested review from akshay18iitg and shuheng-liu January 23, 2026 23:26

WilliamYue37 added the feature New feature or request label Jan 23, 2026

shuheng-liu requested a review from Copilot January 25, 2026 22:03

Copilot started reviewing on behalf of shuheng-liu January 25, 2026 22:04 View session

shuheng-liu previously approved these changes Jan 25, 2026

View reviewed changes

Copilot AI reviewed Jan 25, 2026

View reviewed changes

Apply suggestions from code review

7e07a0f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

shuheng-liu dismissed their stale review via 7e07a0f January 26, 2026 18:57

WilliamYue37 requested a review from shuheng-liu January 26, 2026 19:24

akshay18iitg approved these changes Jan 26, 2026

View reviewed changes

WilliamYue37 merged commit 4fa25eb into main Jan 26, 2026
5 checks passed

WilliamYue37 deleted the feat/reg_tests branch January 26, 2026 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Regression Tests#85

Add Regression Tests#85
WilliamYue37 merged 11 commits intomainfrom
feat/reg_tests

WilliamYue37 commented Jan 23, 2026 •

edited

Loading

Uh oh!

shuheng-liu left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 25, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WilliamYue37 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WilliamYue37 commented Jan 23, 2026 •

edited

Loading